What is NumPy?

NumPy is short for "Numeric Python". It is the basis for nearly all scientific computation in the Python world. Its main contribution is the NumPy ndarray. (See "The NumPy array: a structure for efficient numerical computation", Van der Walt S. et al, 2011. arXiv link)

Why use NumPy?

Reason 1: it's memory efficient



In [ ]:

    
%run -i ../ipython_memory_usage/ipython_memory_usage.py



In [ ]:

    
long_list = list(range(int(1e7)))



In [ ]:

    
import numpy as np

(Due to what appears to be a bug in memory_profiler, you will need to run the cell below twice.)



In [ ]:

    
long_array = np.arange(1e7, dtype=int)

Reason 2: it's fast

We no longer need the memory footprint watcher, so let's turn it off, using the stop_watching_memory function that is provided in the script we ran. This is all a bit magical but don't worry about it.



In [ ]:

    
stop_watching_memory(get_ipython())

Okay, let's time a few operations:



In [ ]:

    
%timeit sum(long_list)



In [ ]:

    
%timeit np.sum(long_array)

Reason 3: it's multidimensional



In [ ]:

    
wide_array = np.arange(45).reshape(5, 9)
print(wide_array)



In [ ]:

    
deep_and_wide_array = wide_array.reshape((5, 3, 3))
print(deep_and_wide_array)

Some NumPy basics

Indexing

One of NumPy's strengths is its sophisticated indexing engine.



In [ ]:

    
deep_and_wide_array[0] # get the first "level"



In [ ]:

    
deep_and_wide_array[-1] # get the *last* "level"



In [ ]:

    
deep_and_wide_array[1:-1] # get all levels between the second (inclusive) and last (not inclusive)



In [ ]:

    
deep_and_wide_array[1, 2] # get the second level, third row



In [ ]:

    
deep_and_wide_array[:, :, 2] # get the third column of every level and row



In [ ]:

    
deep_and_wide_array[::2] # Get every other level

Exercise: Get the following slices out of the deep_and_wide_array:

every other row of the first level.
every other row and column of all levels.
the last row of every level.
the second-to-last level.



In [ ]:

Exercise: Can you name some type of data that you would store in a 1D array? 2D? 3D? 4D? 5D?

1D data:
2D data:
3D data:
4D data:
5D data:

Boolean indexing

You can also access arrays using a boolean mask:



In [ ]:

    
small_nonzero_numbers = (deep_and_wide_array < 10) & (deep_and_wide_array > 0)
print(type(small_nonzero_numbers))
print(small_nonzero_numbers.dtype)



In [ ]:

    
deep_and_wide_array[small_nonzero_numbers]

Exercise: Use a boolean mask to find all data in the deep_and_wide_array that is greater than 30 and that is even.



In [ ]:

Memory layout

NumPy's multidimensional capability is provided by two properties of the array, shape and strides:



In [ ]:

    
print(deep_and_wide_array.shape)
print(deep_and_wide_array.strides)

It's natural to think of a 3D array as a box of little cubes. Each cube holds a number (our data). However, a computer's memory is more like a long line of little cubes (bytes). NumPy needs to know how to take that line of cubes and "think about it" as a box. That's what "shape" and "strides" are for.

The shape tells NumPy how many cubes are on each side of the box. The strides tell NumPy how far along the memory it needs to go to find the next little cube.

That lowest stride is 8, because our array contains 64-bit integers, each of which occupies 8 bytes in memory. That's wasteful here: our numbers only go to 44, so we only need a single byte, or 8 bits. (8 bits can represent numbers up to 255.) In memory, our current integers are written with about 56 leading zeros! We can see a glimpse of this using a numpy "view", which returns a different way of looking at the same memory:



In [ ]:

    
look_at_the_zeros = deep_and_wide_array.view(dtype=np.uint8)
print(look_at_the_zeros[0])

The above array, look_at_the_zeros, is a view of the same memory as our original deep_and_wide_array. This gets back to NumPy's memory efficiency: there are many, many operations in NumPy that reuse memory, views being one of them.

I actually learned something here! Look how the original numbers appear before the "leading" zeros. It turns out that's because Intel processors are what's called little-endian, which means that they store the least-significant bytes first.

Exercise: To convince yourself that these are the same "memory cubes", try modifying the second location in memory in look_at_the_zeros, and then looking at the first "level" of deep_and_wide_array.



In [ ]:

Views, slices, copies

NumPy gives you fine-level control over whether you copy memory or access a different "view" of the same array. This has consequences for how much memory you're using.



In [ ]:

    
start_watching_memory(get_ipython())



In [ ]:

    
a = np.random.rand(1000, 1000)
b = a * 2
c = np.sqrt(b)



In [ ]:

    
a = np.random.rand(1000, 1000)
a *= 2
a = np.sqrt(a, out=a) # use 'a' as the output array as well



In [ ]:

    
half = a.shape[0] / 2
top_half, bottom_half = a[:half, :], a[half:, :]



In [ ]:

    
print("The top left value of a is originally {0}".format(a[0, 0]))
top_half[0, 0] = 0.9
print("top_half[0, 0] = 0.9")
print("The top left value of a is now {0}".format(a[0, 0]))

top_half is just a view of the same memory as a, so modifying top_half modifies a, and vice-versa. Sometimes this is good, sometimes it's bad. If you want a new copy of an array, use the np.copy function:



In [ ]:

    
top_half, bottom_half = np.copy(a[:half, :]), np.copy(a[half:, :])



In [ ]:

    
print("The top left of a is originally {0}".format(a[0, 0]))
top_half[0, 0] = 0.1
print("top_half[0, 0] = 0.1")
print("The top left value of a is now {0}".format(a[0, 0]))

This is also true for discontiguous slices:



In [ ]:

    
evens, odds = a[0::2, :], a[1::2, :]



In [ ]:

    
print("The first second-row value of a is originally {0}".format(a[1, 0]))
odds[0, 0] = 0.1
print("odds[0, 0] = 0.1")
print("The first second-row value of a is now {0}".format(a[1, 0]))



In [ ]:

    
stop_watching_memory(get_ipython())

Computing along axes

NumPy provides many convenience functions to compute, for example, the mean and variance of an array:



In [ ]:

    
np.mean(deep_and_wide_array)



In [ ]:

    
np.var(deep_and_wide_array)



In [ ]:

    
np.std(deep_and_wide_array)

One of the killer features of NumPy is its ability to compute these functions along axes:



In [ ]:

    
np.mean(deep_and_wide_array, axis=0)



In [ ]:

    
np.std(deep_and_wide_array, axis=0)



In [ ]:

    
np.mean(deep_and_wide_array, axis=1)

SciPy extends this convention with its own statistics library. For example, we can compute the 5th percentile along an axis:



In [ ]:

    
from scipy.stats import mstats
mstats.mquantiles(deep_and_wide_array, prob=[0.05], axis=0)

Uh oh! Looks like the mquantiles module is limited to 2D data. Can you think of how to get around it?



In [ ]:

Quantile normalisation of expression data

This is a quick demonstration of using axis-wise computation and fancy indexing to perform a common (though I think slightly outdated) method of normalising expression data, quantile normalisation.

First, we load the data using np.loadtxt:



In [ ]:

    
expr = np.loadtxt('coriell_3s_data_20rows.tsv', delimiter='\t', skiprows=1,
                  usecols=np.arange(1, 7), dtype=int)
print(expr)

(Only columns 1-7 contain the actual expression data, and row 0 is just the sample header so we skip it. Here we are only interested in the numerical data.)

Next, we take the log2 of 1 + the data, which is common for expression data. (Why is beyond the scope of this tutorial!)



In [ ]:

    
logexpr = np.log2(expr + 1)
print(np.round(logexpr, 2)) # round to 2 decimal places just for pretty printing

Then, figure out the expression quantiles, that is, the average value of the nth-ranked observation for all n.



In [ ]:

    
log_quantiles = np.mean(np.sort(logexpr, axis=0), axis=1)
print(np.round(log_quantiles, 2))

Get the rank of each expression value in its own sample (column)



In [ ]:

    
ranks = np.argsort(np.argsort(expr, axis=0), axis=0)
print(ranks)

Then, the coup de grâce, we use the ranks array itself to index quantiles, thus replacing each rank value with the quantile value:



In [ ]:

    
log_quantile_normalised = quantiles[ranks]
print(np.round(log_quantile_normalised, 2))

Finally, we convert the data back to integer counts:



In [ ]:

    
quantile_normalised = (2**log_quantile_normalised - 1).astype(int)
print(np.hstack((expr, np.zeros((19, 1), int), quantile_normalised))) # print side by side to compare



In [1]:

    
%reload_ext load_style
%load_style ../themes/tutorial.css